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Abstract 

We propose a novel algorithm to solve the expectation propagation relaxation of 
Bayesian inference for continuous-variable graphical models. In contrast to most previ- 
ous algorithms, our method is provably convergent. By marrying convergent EP ideas 
from [12] with covariance decoupling techniques [l9l [10] , it runs at least an order of 
magnitude faster than the most commonly used EP solver. 

1 Introduction 

A growing number of challenging machine learning applications require decision-making 
from incomplete data (e.g., stochastic optimization, active sampling, robotics), which relies 
on quantitative representations of uncertainty (e.g., Bayesian posterior, belief state) and 
is out of reach of the commonly used paradigm of learning as point estimation on hand- 
selected data. While Bayesian inference is harder than point estimation in general, it can 
be relaxed to variational optimization problems which can be computationally competitive, 
if only they are treated with the algorithmic state-of-the-art established for the latter. 

In this paper, we propose a novel algorithm for the expectation propagation (EP; or adap- 
tive TAP, or expectation consistent (EC)) relaxation [11^ [HI [T2]. which is both much faster 
than the commonly used sequential EP algorithm, and is provably convergent (the sequen- 
tial algorithm lacks such a guarantee). Our method builds on the convergent double loop 
algorithm of [12j, but runs orders of magnitude faster. We gain a deeper understanding of 
EP (or EC) as optimization problem, unifying it with covariance decoupling ideas [191 llOj . 
and allowing for "point estimation" algorithmic progress to be brought to bear on this 
powerful approximate inference formulation. 

Suppose that observations y G M*" are modelled as y = Xu + e, where u £ M" are latent 
variables of interest, £ ~ N{0,a'^I) is Gaussian noise, and X G M"*^" is the design matrix. 
For example, u can be an image to be reconstructed from y (e.g., Fourier coefficients in 
magnetic resonance imaging [18] ) , further examples are found in [15] . The prior distribution 
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has the form P{u) oc rTi=i^i(^*) with non-Gaussian potentials tj(-), and s := Bu for a 
matrix B. A well-known example are Laplace sparsity priors defined by ti{si) = e"^'!^'! [E], 
where -B collects simple filters (e.g., derivatives, wavelet coefficients). This formal setup 
also encompasses binary classification (u classifier weights, Yli=iti{si) the classification 
likelihood |T2j) or spiking neuron models [4J. The posterior distribution is 

P{u\y) = Z-^N{y\Xu, a^I) Jj" ^ U{s^), (1) 

-■- -'-1=1 

Z := J N{y\Xu,a'^I)Y[i=iti{si) du the partition function for P{u\y), and s = Bu. 
Bayesian inference amounts to computing moments of P{u\y) and/or logZ. Hyperpa- 
rameters / can be learned by maximizing logZ(/) ^ (e.g., motion deblurring by blind 
deconvolution [6]). In Bayesian experimental design (or active learning) [10], X is built up 
sequentially by greedily maximizing expected information scores. These applications require 
posterior covariance information beyond any single point estimate. 

The expectation propagation relaxation along with known algorithms is described in Sec- 
tion [2} scalable inference techniques reviewed in Section [Sj We develop our novel algorithm 
in Section |4| provide a range of real-world experiments (image deblurring and reconstruc- 
tion) in Section [sj and close with a discussion (Section |6]). Upon publication, code for our 
algorithm will be released into the public domain. 



2 Expectation Propagation 

Expectation propagation (EP) [8l [12] stands out among variational inference approxima- 
tions. First, it is more generally applicable than most others (see end of Section [3]). Second, 
a range of empirical studies indicate that EP can be a far more accurate approximation 
to Bayesian inference than today's competitors of comparable running time [5l [9]. Conse- 
quently, EP has been applied to a diverse range of modelsj^ On the other hand, EP is more 
difficult to handle than most other methods, for a number of reasons. It is not an optimiza- 
tion problem based on a bound on log Z ([T]), but constitutes a search for a saddle point [12j . 
Moreover, its stationary equations are more complicated in structure than commonly used 
bounds. Finally, running EP can be numerically challenging \T\. 

In the sequel, we describe the variational optimization problem behind (fractional) EP, 
details can be found in [SldSKTS]. The goal is to fit the posterior distribution P{u\y) from 
([l]) by a Gaussian of the form 

Qiu\y) := ZQlAf(y|X^t,cJ2/)e''^^-i^^(<^'^s'^)^ 
CovQ[u\y]-^ = A :=a-^X'^X + B^{diag7r)B, (2) 

where Zq := J N{y\Xu,a'^I)e^'^''-^^'^^'^'''^''> du, s = Bu. Qiu\y) depends on the vari- 
ational parameters b and vr ^ 0, collected as = (tt, b) below. Let marginal distri- 
butions N(fii,pi) be indexed by moment parameters /x, p, E (0,1] a fractional pa- 
rameter (while standard EP uses rj = 1, r] < 1 can strongly improve numerical stabil- 
ity [l5]). For i G {l,...,q}, denote Ki = Ki{si) := biSi — \'nisf. The cavity marginal is 

^ A comprehensive bibliography can be found at [reseaLrch. microsoft . com/en-us/vun/people/minka/l 
papers/ep/roadmap . html 
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Q-i{si) oc N{si\fii, pi)e~^'^' , the tilted marginal Pi{si) oc Q^i{si)ti{si)^ . While Pi{si) is not 
a Gaussian, its moments (mean and variance) can be computed tractably. An EP fixed 
point (vTjb) satisfies expectation consistency ^12j: if N{fii,pi) = Q{si\y), then -Pi(sj) and 
Q{si\y) have the same mean and variance for all i = 1, . . . , g. The corresponding (negative 
free) energy function is 

(p{7r,b,fx,p) := -2log Zq 

- ?ELi (logEQ.Jt.(..)''] -logEg.Je'''^^]) , 

where Zq is the partition function of Q{u\y) (see Eq. [2]). If we define fi, p in terms of tt, 
b (by requiring that N(fj.i,pi) = Q{si\y)), it is easy to see that Vt^^P = ^b'P = implies 
expectation consistency. However, this dependency tends to be broken intermediately in 
most EP algorithms. A schematic overview of the expectation consistency conditions is as 

follows (notations 0, s*, z are introduced in subsequent sections; denotes Gaussian 
moment matching): 

N{pi,pi) Q-i{si) cx N{si\pi,pi)e"^''' 

i (3) 
Q{si\y) < — > Pi{si) oc Q-i{si)ti{si)'' 

=N{s,i,Zi) 

The total criterion (j){7r, b, p{7v, b), p{7v, b)) is neither convex nor concave |12j. 

The most commonly used sequential EP algorithm visits each potential i G {1, . . . ,q} in 
turn, first updating //j, pi, then vTj, bi based on one iteratioE0 of = = [SI [H]. 
For models of moderate size n, a numerically robust implementation maintains the inverse 
covariance matrix A ([2]) as representation of Q{u\y). A sweep over all potentials costs 
0{q'n?). If memory costs of O(n^) are prohibitive, we can determine pi on demand by 
solving a linear system with A, in which case a sweep requires q such systems. The sequential 
EP algorithm is too slow to be useful for many applications. Notably, all publications for 
EP we are aware of (with the exception of two references discussed in the sequel) employ 
this method, generally known as "the EP algorithm". 

In |3], a parallel variant of EP is applied to rather large models of a particular structure. 
They alternate between updates of all /x, p and all tt, fo, the latter by one iteration of 
d-Trcj) = db4> = (these equations decouple w.r.t. i = l,...,g). The most expensive step 
per iteration by far is the computation of marginal variances p, which is feasible only for 
the very sparse matrices A specific to their application. Neither sequential nor parallel 
algorithm come with a convergence proof. 

A provably convergent double loop algorithm for EP is given by Opper&Winther in |12j . For 
its derivation, we need to consider a natural parameterization of the problem. The underlying 
reason for this is that log partition functions like log Zq ^ are simple convex functions in 
natural parameters, and derivatives w.r.t. the latter result in posterior expectations. Collect 
= (tt, 6) and recall that Kj = — ^vTjS?. Let = (7r,&) be natural parameters corre- 
sponding to p,p {iTi = l/pi, bi = p.ilpi), and ki = biSi-\kisl, so that N{si\p,i,pi) = Zr^e'^S 

^ "One iteration" means solving for tt;, 6i, assuming that the cavity distribution Q-^i{si) is fixed (ignoring 
its dependence on tt;, bi). 
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where Zi = J e'^' dsi is the normahzation constant. With 9 = (7r_,fo_) = — rj9 and 
K-i = h-iSi — \-K-is1 = Ri — r]Ki, we have that Q-i{si) oc e**-' and Pi{si) = Zj^^e'^-Hi{si)^ 
withZ, = / e'''%{s,)^ds,.U<t>n{0-,0) := -^^M Zi-2log Zq cind M^) ■.= ^Ei^ogZi, 
we have that (p{6-,d) = 0n(0-,6') + 4'\j{6), where (j)f]{6-,0) is jointly concave]^ while 
(pui^) is convex. Define 0(0) := maxg (j){d^,6). The Opper&Winther algorithm (locally) 
minimizes (j){0) via two nested loops. The inner loop (IL) is the concave maximization 
6 ^ ar gmaxQ _ (f)r\{0-,0) for fixed 0. An outer loop (OL) iteration consists of an IL 
followed by an update of 0: /x ^ EQ[s|y], p ^ VarQ[s|y]. Within the schema ([s]), the 

IL ensures expectation consistency -f^^ in the lower row, while the OL update equates 
marginals in the left column. While this algorithm provably converges to a stationary point 
of (piO) whenever the criterion is lower bounded |12j . it is expensive to run, as variance 
computations VarQ[s|y] are required frequently during the IL optimization (convergence 
and properties are discussed in the Appendix). Finally, since 9 = r]~^{0 — 0_), concave 
maximization w.r.t. 9- for fixed 9 can equivalently be seen as concave maximization w.r.t. 
9. We will do the latter for notational convenience in the sequel. 



3 Scalable Variational Inference 

Scalable algorithms for a variational inference relaxation]^ different from EP have been 
proposed in |10^ [T6] (this relaxation is called VB in the sequel, for "Variational Bound- 
ing"). They can be used whenever all potentials are super-Gaussian, meaning that ti{si) = 

max7rj>o e^''^'"^'^''*^"'^''''^''*''^ for some hi{-Ki), which implies the bound — 21ogZ < (I)^^{tt) := 
—2log Zq + h{7r) on the log partition function of P{u\y) (up to an additive constant), 
where h{7v) := 'Yli^ii'^i)- Note that in this relaxation, b is fixed up front (fo = if all 
potentials ti{si) are even), and tt are the sole variational parameters. They proceed in two 
steps. First, — 21ogZQ = log|A| + min^^ i?(7r, b, n*) (up to an additive constant), where 
i?(7r, b,u^,) := (T~^||y — Xtt=i,|p+sJ'(diag 7r)s=K — 2fo^s*, s^, = Bu^. Second, since tt i— )• log \A\ 
is a concave function, Fenchel duality |14| ch. 12] implies that log |A| = min^ z^tt — g*{z) 
for some g*{z). The variational problem becomes 

min(/.V^(7r) (4) 

= min min iv — g* {z) + R{tz ^b^u^) + h{T^). 

It is solved by a double loop algorithm, alternating between inner loop (IL) minimizations 
w.r.t. TT, for fixed z and outer loop (OL) updates of z and g*{z). 

The important difference to both the double loop algorithm of |12] and the parallel algo- 
rithm of [3j lies in the decowp/mg transformation log |A| = min^ z'^iT—g*{z). (j)^^{7v) is hard 
to minimize due to the coupling term log | A|. For example, Vtt log |A.| = diag{B B^) = 
YaTQ[s\y] requires Gaussian variance computations, which are very expensive in practice 
|16) . But log|A| is replaced by a fixed linear function in each IL problem, where we can 
eliminate tt analytically and are left with a penalized least squares problem of the form 

^ Log partition functions (logZi, logZg) are convex in their natural parameters, and 6 — ri^^{6 — 0-) 
is linear. 

In contrast to EP, this relaxation is convex iff all ti{si) are log-concave [lU] . 
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miiiu, cr~^l|y ~ -'Cw^lp — 6asy to solve with standard algorithms that do not 

need Gaussian variances at all. To understand the decoupling transformation more gen- 
erally, consider minimizing Q w.r.t. each variable in turn, keeping the others fixed. The 
solutions are it* = Eq[w|i/] (means) and z = Vn-log|A| = VarQ[s|y] (variances). The 
role of decoupling is to split between computations of means and variances |16) : the latter, 
much more expensive to obtain in general, are required at OL update points only, much less 
frequently than the former (means) which are obtained by solving a single linear system. 

Note that several important models come with potentials which are not super- Gaussian (e.g., 
Poisson potentials for spiking neuron models [3], or potentials like the exponential, which 
become zero), but can easily be handled with EP. Moreover, EP seems to be substantially 
more accurate as approximation to Bayesian inference [5l [9] . To construct an efficient EP 
solver, we have to make use of decoupling in a similar fashion, so to minimize the number 
of Gaussian variances computations, while retaining provable convergence. 



4 Speeding up Expectation Propagation 

A fast and convergent EP algorithm is obtained by marrying the double loop algorithm of 
|12j with the decoupling trick of |10j . During its course, (or fi, p) will mainly be fixed, and 
we will drop it from notation accordingly (but recall that the Zi depend on it). Moreover, 
we will typically work with = [0 — 0-)/rj rather than 6-. Then, 

</'n(0) (5) 
= min — g*{z) + R{'iT,h,u^) — 2i]^^y^ logZj 

^ V ' 

=:(j>r\{v ,0) , ti=(z,u,) 

= mina~^||2/ - Xw*||^ - ilji{s^i,TTi,hi) - g*{z), 

il^i := -{zi + sli)TTi + 2biS*i + 2r]^^ log Zi. 

With V = (jz,u*) and (pni^) = niin^ (pi-^{9,v), the IL problem of |12j is max^ min^, (pf^. As 
shown in the Appendix, (/>n(^, v) is a closed proper concave-convex function (convex in v for 
each 0, concave in 9 for each v) [T3]. Strong duality holds: max^ min^ 0n = min^ maxg (pf-], 
so the IL problem is equivalent to 

min ( minu^^lly - Xtt*||^ - T^. ■(/'i(s*i) ) - g*{z), 

Z \ It* ^ J 

ipi{si) := min ■ipi{si,-Ki,bi). (6) 

This problem is jointly convex in z, tt* (note that ^i{s^:i) is concave as minimum of concave 
functions, and the minimization over TTi,bi is a jointly convex problem). Solving the inner 
problem of ([g]) for fixed jz is a simple and very efficient penalized least squares building block, 
denoted by (w*, 9) PLS(z, 9) in the sequel. Note that at its solution, it* = EQ[it|i/], where 
Q{u\y) is indexed by 9. 

This means that the problem addressed in [12] can be written in the form min^ g ^(z, 9). 

The significance is the same as in Section jsj both cj){z, 9) and min^ 4>{z, 9) (local minimum) 
for fixed z can be determined very efficiently. The dominating cost of computing Gaussian 
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variances is concentrated in the update of z. Two main ideas lead to the algorithm we 
propose here. First, we descend on (l){z,6) rather than 4>{9) = mmz(j){z,0) [12], saving 
on variance computations. One iteration of our method determines z VarQ[s|2/], then a 
local minimum min^j (p{z, 0) in a convergent way. Empirically, such "optimistic" iterations 
seem to always descend on (p{z,0) until convergence to a stationary point of </>(0), but 
just as for the sequential or parallel algorithm, we cannot establish this rigorously. At this 
point, the second idea is to rely on the inner loop optimization of \T1\ in order to enforce 
descent eventually. We obtain a provably convergent algorithm by combining optimistic 
steps miUg (j){z , 6) for fixed z with the rigorous but slow mechanism of [12J. As most, if 
not all optimistic steps produce sufficient descent in practice, provable convergence comes 
almost for free (in contrast to p[2], where it carries a large price tag). 

To flesh out this notion, denot^ <l){0,z,6) = z'^n — g*{z) + {miUu, R{TT,h,u^)) — 
27/"^ log Zj, and (j){z,6) = maxg 0(0, z, 0). Note that (j){9,9) = min^ 0(0, z, 0), more- 
over maxg 0(0,0) = min^ max0 0(0, z, 0) by strong duality. First, 0(0) < (p{z,6), so that 
(f)(z,0) is lower bounded if 0(0) is (which, like |12|, we assume). Next, as shown in the 
Appendix, we can very efficiently minimize 0(z,0) locally w.r.t. by setting p z, then 
iterating between {u^,6) ^ PLS(z,0) and ^ ^ s^: = Bu^, = EQ[s|y]. In the sequel, we 
denote this subalgorithm by 0' ^ updateTTil(z, 0). While updateTTil may call PLS mul- 
tiple times, it does not require expensive Gaussian variance computations. An "optimistic" 
step of our algorithm updates z' ^ VarQ[s|y], then 0' ^ updateTTil(z', 0), at the cost 
of one variance computation. Within the schema ([s]), we update z, set p z, then attain 
expectation consistency and /x = EQ[s|y] = s^, = Bu^, for fixed variances z, p. 

Suppose we are at a point z, (and 0), so that is a local minimum point of 0(z, 0). How 
can we descend: 0(z', 0') < 0(z, 0) unless is a stationary point of 0(0)? Let 0^^^ = 0. The 
optimistic step would be z^^) = VarQ[s|y], then 0' ^ updateTTil(z(^\ 0). If 0(z(^\0') is 
sufficiently smaller than 0(z, 0), we are done with our descent step: z' = z^^\ Otherwise, we 
run one iteration 0^^^ — )• 0^^^ of the inner optimization max^ <j){6,0) of [12]. This requires 
variance computations, while z^^^ can be reused (and z^^^^ may already be computed). We set 
^ 0^^) and attempt another optimistic step: z^'^\ updateTTil(z(^), 0). Without interven- 
ing descent, we would eventually obtain 0^'^^ = max^ 0(0, 0), thus z*^*^) = argmin^/ 0(z', 0). 

If no descent happens from there, must be a stationary point of 0(0) (see ^12j and Ap- 
pendix). 

Note that in most cases in practice, our algorithm does not run into the inner optimization 
of [12] even once (unless to confirm final convergence). Yet the possibility of doing so is 
what makes our convergence proof work. Algorithm [T] provides a schema. 

A word of warning about the inner optimization max^ 0(0,0). From it is tempting 
to iterate between z ^ YarQ[s\y] and {u^:,0) PLS(z,0). However, this does not lead 
to descent and typically fails in practice. As seen in Section [3| the update of z serves to 
refit an upper bound, suitable for minimizing, but not maximizing over 0. In our algorithm, 
this problem is compensated by the minimization over 0: optimistic steps seem to always 
descend. 

^ In the sequel, we will eliminate u*_by minimization in our notation. Since strong duality holds, we can 
move min„, outside when solving PLS |6| at any time (for fixed z, 0). 
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Algorithm 1 Double loop EP algorithm. 

The part shaded in grey was never accessed in our experiments (see text for comments). 

A(o,6) := (6-a)/max{|a|,|6|,10-9}. 
Iterate over z,6 {fi,p). 
repeat 
7r(i) =7r. 

for k = 1,2, . . . do 
z^^^ = Ya-r Q[s\y]. 
{6', 6') ^ updateTTil(jz('=), 0). 
if /^{(p{z^^\e'),(t){z,e)) > e then 

Sufficient descent: z ^ z^''\ 6 ^ 6', 

6^6'. Leave loop over k. 
else 

Run iteration of max^ 4>{6, 0) : 
oik) _^ 0{k+i)_ Set d ^ 

if \A{(P{e^''+^\e),(P{e^''\e))\ < e then 

Converged to stationary point 0: Terminate algorithm, 
end if 
end if 
end for 

until Maximum number of iterations done 



4.1 Computational Details 

In this section, we provide details for computational primitives required in Algorithm [T} 
First, we show how to efficiently compute PLS, i.e. solve the inner problem in Q for fixed 
.z 0. As all are concave, this is a convex penalized least squares problem, for 

which many very efficient solvers are available. A slight technical challenge comes from the 
implicit definition of the regularizer: evaluating and its derivatives entails a bivariate 
convex minimization. 

In our experiments, we employ a standard gradient-based Quasi-Newton optimizer. Sup- 
pose we are at u^: and have determined the maximizer 6 = {7v,b). If f{u^) = a^'^\\y — 
- ^iipi{s^.i), then = 9s,-^/^i(s*i, vrj, ftj) = 2(6j - vTjS^j), so that Vu^f{u^) = 

2(j~'^X'^{Xu^ - y) + 2B^{n o - b), at the cost of one matrix-vector multiplication 
(MVM) with X'^X, , B respectively (here, "o" denotes the componentwise product). 
For the bivariate minimizations, the derivatives are cJfc-V'i = 2{s^i — Ep [sj]), d-j^^tpi = 

— {zi + s^j) + Ep [s^]: we have to adjust 6i,7rj so that mean and variance of Pi coincides 
with Sifi and Zj. Details for the computation of Pi are given in [15]. In our implementation, 
we initialize the minimization by two standard EP updates, then run Newton's algorithm 
(details are given in a longer paper). Even for large q, these bivariate minimizations can of- 
ten be done more rapidly than MVMs with X^X. Moreover, they can be solved in parallel 
on graphics hardware. 

The inner optimization max^ </>(0, 6) of [T2] can be addressed by any convex solver. We 
employ Quasi-Newton once more. The gradients are db4>{Tr,b,d) = 2((Ep [s,;]) — (Egfsi])), 
dT^(t){'K,b,6) = (Eq[s^]) — (Ep [s^]). This computation entails z = VarQ[s|y]. Note that 
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with a standard solver, a sufficient increase in (f){6, 0) (for fixed 0) may require a number 
of VarQ[s|y] computations. We are not aware of an effective way to decouple this problem 
as in Section [3l 



Gaussian Variances 

Finally, how do we compute Gaussian variances z = Ya.VQ[s\y] = diaig{BA^^B^)? This 
is by far the most expensive computation in all EP algorithms discussed here: our main 
contribution is a novel convergent algorithm which requires few of these calls. In our exper- 
iments, n is a few thousand, q ~ 3n, and we can maintain an n x n matrix in memory. We 
use the identity 

z = diag (^BA-^ S^dfB^ = Y^XBA'^Si) o (BSi), 

where 6i = (I|j=j})j. We compute the Cholesky decomposition A = LL^ , then from 
L, using LAPACK code, then accumulate z by 2n MVMs with B. 

If n is larger than 10^ or so, this approach is not workable anymore. If A is very sparse, it may 
possess a sparse Cholesky decomposition which can be determined efficiently, in which case 
z is determined easily [3j . However, for typical image reconstruction models, X is dense. For 
the VB relaxation of Section [3j variances have been approximated by the Lanczos algorithm 
|18t [Tn| . It is noted in [TB] that variances are strongly (but selectively) underestimated in this 
way, and consequences for the VB double loop algorithm are established there: in a nutshell, 
while outcomes are qualitatively different, the algorithm behaviour remains reasonable. In 
contrast, if any of the EP algorithms discussed in this paper are run with Lanczos variance 
approximations, they exhibit highly erratic behaviour. Parallel EP [3J rapidly diverges, 
our variant ends in numerical breakdown. While we are lacking a complete explanation for 
these failures at present, it seems evident that the expectation consistency conditions, whose 
structure is more complicated than the simple VB bound, do not tolerate strong variance 
errors. Our observation underlines the thesis of [16]. Robustness to variance errors of the 
kind produced by Lanczos becomes an important asset of variational inference relaxations, 
at least if large scale inference is to be addressed. The EP relaxation, as it stands, does 
not seem to be robust in this sense. Explaining this fact, and possibly finding a robust 
modification of the expectation consistency conditions, remain important topics for future 
research. 



5 Experiments 

5.1 Expectation Propagation vs. VB 

In the following experiment, we compare approximate inference outcomes of EP (Section [2]) 
and VB (Section |3]), complementing previous studies O |9]. We address the (non-blind) 
deconvolution problem for image deblurring (details ommitted here are found in [6j): u E M" 
represent the desired sharp image, X = (diag f)Fn, where is the nx n discrete Fourier 
transform (DFT)[^ / = F^f the spectrum of the blur kernel /, and y = Fny, y the 

Strictly speaking, we encode C by R^, and F„ is the "real-to-complex" DFT (closely related to the 
discrete cosine transform). Both / and y axe Hermitian and can be stored as R" vectors. 
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blurry image. Our model setup is similar to what was previously used in [16]: P{u) is a 
Laplace sparsity prior (see Section [T]), the transform B consists of an orthonormal wavelet 
transform Ba and horizontal/vertical differences B^ ("total variation"), corresponding prior 
parameters are Ta, Tr. Recall that b is fixecQ depending on the ti{-) in VB: since they are 
even, 6 = 0. In contrast, they are free variational parameters in EP. Posterior marginals, 
as approximated by EP and VB, are shown in Figure [T| while we compare parameters b, tt 
in Figure [2| 

The EP and VB approximations are substantially different. While the means are visually 
similar, EP's posterior variances are larger and show a more pronounced structure. An 
explanation is offered by the striking differences in final parameters b, tt. Roughly, vTj scales 
the degree of penalization of Si [TS]. While both EP and VB strongly penalize certain 
coefficients, VB (in contrast to EP) seems to universally penalize all Si (all vtvb.i > 10), 
thus may produce small variances simply by overpenalization. EP clearly makes use of b, 
which allow to control the posterior mean independent of the covariance: a mechanism not 
available for VB. It is important to note that our findings are in line with those in [9], who 
found that VB strongly underapproximated marginal variances (they obtained the ground 
truth by expensive Monte Carlo simulations). As noted in Section [T| it is often the posterior 
uncertainty estimates (covariances) which give Bayesian decision-making an edge over point 
estimation approaches. 



5.2 EP Timing Comparison 

In this section, we provide timing comparisons between EP algorithms discussed in this 



paper. Our setup is much the same as in Section |5.1[ but both the choice of X and data 
is taken from |16j . The problem is inference over images u € M"' from "Cartesian MRI" 
measurements (discrete Fourier coefficients) y E C", so that X = Ij^.Fn, where J is 
an index selecting acquired coefficients (in fact, complete columns in DF space ("phase 
encodes") are sampled, according to a design optimized for natural images). The prior is 
the same as used above. 

In our first experiment, we use 64 x 64 images (n = 4096, q = 12160) and a design X 
sampling 16 columns (m = 1024, 4 times undersampled) . We compare the sequential and 
parallel EP algorithms with our novel fast (convergent) EP method. We chose not to include 
results for the double loop algorithm of [H], since it runs even slower than the sequential 



method (see comments in Section 4.1). Our results are averaged over 20 different images 
(the y vectors are noisy acquisitions, = 10~^, but the same across methods). Moreover, 
Ta = 0.04/(T, Tr = 0.08/(7 (same values as in [l6]). Timing runs were done on an otherwise 
unloaded standard desktop machine. For each run, we stored tupels (Tj,(pj) at the end of 
each outer iteration (for sequential EP, this is a sweep over all potentials), Tj elapsed time 
(in sees), cpj the EP energy value attained. On a fixed image, all methods eventually attained 
the same energy valu^ (say, 0^,), and we show (Tj, \ — (j)^) / . Results are presented 
in Figure [3| left. First, the sequential algorithm is not competitive with the others. At a 
time when the others converged, it is roughly 1/4 through its first sweep (while requiring 
about four sweeps to converge). Second, the parallel and our fast EP algorithm converge 
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''' This is an inherent feature of the variational bound, which would cease to be valid if b were optimized 
While this is not guaranteed by present EP convergence theory, it happened in all our cases. 
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in about the same time. However, ours does so much more smoothly and attains a near 
optimal solution more quickly. 

In a second experiment, we use a single 128 x 128 image (n = 16384, q = 48896) and a 
design X sampling 36 columns (~ 3.5 times undersampled) . We compare the parallel with 
our fast EP algorithm, since the sequential method is clearly infeasible at this scale. Here, 
o"^ = 2 • 10~^, Ta = 0.04/(7, Tr = 0.08/(7. Results are presented in Figure [sj right. On this 
larger problem, our algorithm converges significantly faster. 

Our method (fast EP in Figure |3]) is provably convergent, while parallel EP (and sequential 
EP) lacks such a guarantee. Beyond, the main difference between fast and parallel EP lies in 
how thoroughly variance computations are exploited. Fast EP spends more effort between 
them, solving min^ (p{z,6) = min^ max^ (p(6,z,d), while parallel EP simply does a single 
EP update. Our method therefore incurs an overhead, which motivates the results for 64x64 
images. However, this overhead is modest (each step of PLS costs 0(g + nlogn)), while the 
cost for variances, at 0(n(n^ + g)), grows very fast. The overhead for fast EP pays off in the 
128 X 128 image example, due to the fact that it requires about two variance computations 
less than parallel EP to attain convergence. Notably, the overhead cost can still be greatly 
reduced by running different algorithms (see Section [6]) or parallelizing the computations of 
the V'i(**i)) which is not done in our implementation. 

6 Discussion 

We proposed a novel, provably convergent algorithm to solve the expectation propagation 
relaxation of Bayesian inference. Based on the insight that the most expensive computations 
by far in any variational method concern Gaussian variances, we exploit a decoupling trick 
previously used in ^191 [TO] in order to minimize the number of such computations. Our 
method is at least an order of magnitude faster than the commonly used sequential EP 
algorithm, and improves on parallel EP [3], the previously fastest solver we are aware of, 
both in running time and guaranteed convergence. Moreover, it is in large parts similar 
to recent algorithms for other relaxations |,10j. which allows for transfer of efficient code. 
While the sequential EP algorithm is most widely used today, our results indicate that this 
is wasteful even for small and medium size problems and should be avoided in the future. 

There are numerous avenues for future work. First, for problems of the general form dis- 
cussed in Section [5| the central penalized least squares primitive PLS could be solved more 
efficiently by employing modern augmented Lagrangian techniques, such as the ADMM al- 
gorithm reviewed in [2] (today's most efficient sparse deconvolution algorithms are based on 
this technique), and by parallelizing the innermost bivariate optimization problems leading 
to ipi{s^i) and its derivatives. Such measures would bring down the (already modest) over- 
head of our technique, compared to parallel EP. Moreover, we aim to resolve whether the 
"optimistic steps" our algorithm is mainly based on, provably lead to descent by themselves 
(this would render the fallback on [12], shaded in Algorithm [T| obsolete, thus simplify the 
code). 

Known EP algorithms (including ours presented here) break down in the presence of sub- 
stantial Gaussian variance approximation errors, in contrast to algorithms for simpler re- 
laxations which behave robustly. If real-world Bayesian image applications such as those in 
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Section [5] are to be run at realistic sizes, variance errors cannot be avoided. The most im- 
portant future direction is therefore to understand the reason for this non-robustness of EP 
algorithms (or even the expectation-consistency conditions as such) and to seek for alter- 
natives which combine the accuracy of this relaxation with good behaviour in the presence 
of typical Gaussian variances approximation errors |16j . 



Appendix 

We start by reviewing the convergence proof for the EP double loop algorithm of Section [2] 
|12j . The problem is min^ maxg (/>n(0-, 0) + (j)u{6). Now, 0n(^) = niax0_ (j)Q{0^, 6) is con- 
cave. If 0_ = argmin(/)(6>_,6»), then (f>{9') < R{e') := (l)n{0-,0)-g'^{e' -e) + (f>u{0'), where 
g = -^eMO) = -dg(t>n{e-,e) pi, ch. 12]. if = ri-\9 - then g = O^llogZQ = 
r]^^{EQ[s\y], —^EqIs'^Iu]). Now, 0u(^) = R{^)-, and R{0') is convex, its minimum defined 
by V Q,(l)\j{0') = g. Therefore, minimizing R{0') leads to (j){0') < 4>{0), unless g = V q4>\j{6), 

thus V Q(j){6) = 0. Since the sequence 4){0) is nonincreasing and lower bounded, it must 
converge to a stationary point. To determine g, note that if u^^ is the minimizer in 
then EQ[s|y] = = Bit* and EQ[s^|y] = si + VarQ[s|7/]. Moreover, since </>(0') is the sum 
of log partition functions of N{fii,pi), the equation V Q,(j)yj{6') = g is solved by = s*, 
p' = VarQ[s\y]. 

Importantly, exactly the same argument establishes the convergence (to a stationary point) 
of miug, (j){z, 6') for any fixed z y 0, thus the computation of updateTTil in Section 4 We 

only have to replace log |A(7r)| by z'^tv — g*{z) (both are concave in 6, therefore concave 
in {9-,0)), noting that the gradient w.r.t. tt changes from V7rlog|A| = VarQ[s|i/] to 
V-„{z^7v — g*{z)) = z. The only difference to the algorithm of [12J just discussed is that p 
is updated to z, not to VarQ[s|y], so that variances do not have to be computed. 

Next, we establish the properties of the inner loop problem max^ i;^n(^) ^) (Eqs. [sj 
In particular, we prove that strong duality holds. Recall that v = {z,u^:) and (f)r\{v-,0) 
from p]). We begin by extending (j)r\{v,6) for all values of z and tt [H]. First, g*{z) = 
inf,r ^^r— log I A(7r)| is the conca?;e dual function of log | A (7r)|. Since log|A| — )■ oo whenever 
any vrj — )• oo [l7], then g*{z) — oo as any Zi \ 0, and <^n := +oo if any Zi < 0. Moreover, 
4>ri '■= —oo if 2 ;^ and any VTj < 0, and </>n(i', tt, b) := limTj-x^^ (pr,{v,7r, b) for any tt ^ 0. 
With these extensions, it is easy to see that (j)f^{v,9) is a closed proper concave-convex 
function [14] ch. 33]: convex in v for each 0, concave in for each v. Note that we always 
have that maxg min„ (/ip < min„ maxg (weak duality) . In order to establish equality 
(strong duality), we show that (j)f^[-,9) do not have a common nonzero direction of recession. 
Given that, strong duality follows from \\.4\ Theorem 37.3]. 

Theorem 1 Let 4>{v,6) be defined as in and extended to a closed proper concave- 
convex function. If 9 = (tt, b) is such that tv y and A(7r) is positive definite, then 0(-, 6) 
has no nonzero direction of recession. For any d ^ and any v so that (f){v, 6) < oo: 

^.^ ^{v + td,e)-c^{v,e) ^ ^ 

t-^oo t 



Proof Write F{v) = (/>n(i', 0) for brevity, and pick any d 7^ 0. d is a direction of recession 
iff limt^oc{F{v + td) — F{v))/t < for some v [HI Theorem 8.5]. Pick any v = (z,m*). 
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z >~ 0, and let d = {dz,du). If du ^ 0, then F{v + td) = Q.{t^) by the positive definite 
quadratic part. If {dz)i < for any i, then there is some to > so that {z + tdz)i is negative 
and F{v + td) = oo for all t > to. This leaves us with du = 0, ^ 0, so that {dz)i > 
for some i. Let tt = tt — {7Ti/2)Si. By definition, g*{z + tdz) < {z + tdz)^TT — log |A(7r)|, 
therefore 

F{v + td)-F{v) nc ^g*{z)-g*{z + tdz) 
t ^ t 

^^T( , g*{z)+ \og\A{iT)\- z'^iT 

> (TT - TTj H 



--TTi{dz)i/2 + 



t 

g*{z) + \og\A{^)\- z"^^ 



t 

which is positive as t — )• oo. 
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sharp image: u convolution filter: f blurred image: y 




VB posterior mean 




VB posterior standard dev 




EP posterior mean 




EP posterior standard dev 




Figure 1: Deconvolution setting and resulting marginals (variances on u, not on s). u is 
48 X 73, pixels, the kernel / is 22 x 25 (n = 3504, q = 10512, = = 15, = IQ-^). 
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Convergence of EP algorithms: 64x64 example 




Figure 3: Timing comparison of EP algorithms for inference over greyscafe images. Left: 
64 X 64 images. Right: 128 x 128 image. Shown is relative distance to EP energy stationary 
point Kf/* — I as function of running time (left: mean, two std. over 20 different images). 
Algorithms: sequential EP (Section [2j left only), parallel EP (Section [2]), and fast EP (our 
method). 
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